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ABSTRACT 


s_T)The  Wllcoxon  Rank  Sum  (or  Mann-Whltney)  Test  Is  among  the  most  useful 
and  powerful  of  the  non-parametric  hypothesis  tests.  However,  as  with  many 
hypothesis  tests,  when  a clear  alternative  hypothesis  and  corresponding 
power  analysis  Is  not  present,  the  practical  Interpretation  of  results 
using  this  test  suffers  greatly.  This  paper  presents  and  clarifies  an 
Uernatlve  suggestd  by  E.  L.  Lehmann  In  1953  and  provides  tables  of 
practical  use  which  have  not  prvlously  been  calculated  due  to  computational 

difficulties.  ^ 
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j On  the  Lehmann  Power  Analysis  for  the 

3 Wllcoxon  Rank  Sum  Test 

|j  The  Wllcoxon  Rank  Sum  (or  Mann-Whltney)  Test  Is  among  the  most  useful  and 

5 powerful  of  the  non-parametrlc  hypothesis  tests.  However,  as  with  many  hypo- 

fi  thesis  tests,  when  a clear  alternative  hypothesis  and  corresponding  power 

0 analysis  Is  not  present,  the  practical  Interpretation  of  results  using  this 

; test  suffers  greatly.  This  paper  presents  and  clarifies  an  alternative  sug- 

1 gested  by  E.  L.  Lehmann  In  1953  (Annals  of  Mathematical  Statistics  [7])  and 

1 provides  tables  of  practical  use  which  have  not  previously  been  calculated  due 

Jj  to  computational  difficulties.  This  work  has  recently  been  applied  to  survey 

J data  gathered  for  the  US  Army  Logistics  Center.  (See  reference  [5].) 

•i 

!i  When  sample  sizes  are  small,  and  a power  analysis  Is  not  available,  one  may 

j fail  to  reject  the  null  hypothesis  when  the  true  state  of  nature  Is  very 

| different  from  what  Is  stated  In  the  null  hypothesis.  With  a small  sample  size 

and  smalld,  It  may  be  Impossible  to  reject  HQ.  Further,  when  sample  sizes  are 

very  large,  the  null  hypothesis  may  be  rejected  at  a very  small  significance 

level  when  actually  the  null  hypothesis  Is  so  nearly  true,  that  It  Is  close 

enough  for  all  practical  purposes.  Taken  to  the  extreme,  with  Infinite  sample 

! sizes,  the  attained  significance  level  will  be  zero,  even  when  there  Is  only  a 

very  small,  but  finite  difference  between  H.  and  the  true  state  of  nature. 

o 

Thus  significance  level  can  be  very  misleading  If  used  alone. 

When  a null  and  a definitive  alternative  hypothesis  can  both  be  stated,  and 
probability  distributions  found  under  each,  the  results  of  an  hypothesis  test 
can  be  stated  similarly  to  a confidence  Interval  If  the  "point  estimate"  from 
the  observed  values  falls  between  the  two  hypotheses.  In  the  case  of  the 
Wllcoxon  Rank  Sum  Test,  only  one  alternative  hypothesis  has  been  well  developed 
and  will  be  presented  here.  Due  to  the  nature  of  this  test,  however,  even  If 
the  evidence  may  strongly  Indicate  that  the  true  state  of  nature  Is  not  bounded 
between  this  alternative  and  the  null  hypothesis,  this  power  analysis  can  still 
be  used  to  obtain  a reasonable  estimate  of  what  the  actual  state  of  nature 

; happens  to  be.  (In  the  case  of  the  Multiple-sample  Westenberg-type  tests  of 

reference  [4],  an  alternative  must  be  picked  such  that  the  true  state  of  nature 
? Is  Indicated  to  be  boundod  by  the  null  and  alternative  hypotheses.  Fortun- 

$ ately,  that  Is  not  the  case  here,  nor  was  It  the  case  In  reference  [6],  which 

Isa  multi-sample  test.) 

j Consider  that  the  null  hypothesis,  HQ,  of  the  Wllcoxon  Rank  Sum  Test 

! Indicates  that  P(X<Y)  = 1/2.  That  Is,  under  HQ,  any  value  picked  at  random 

! from  the  Y population,  is  larger  than  any  value  picked  at  random  from  the  X 

! population,  with  probability  of  1/2.  Here  an  alternative  hypothesis,  H^,  is 

; used  such  that  P(X<Y)  = 2/3.  (The  exact  form  of  Hj  is  discussed  In  [7].) 

Graph  1 Illustrates  a possible  configuration  for  this  alternative  hypothesis. 

1 For  this  example,  consider  that  under  H0,  all  observations  are  taken  from  a 

N(r,s)  distribution  such  as  the  N(5,l)  shown  on  the  left  In  graph  1,  but  under 
Hj,  the  Y sample  comes  from  the  N(r+0.61s,  s)  distribution,  while  the  X sample 

comes  from  the  N(r,s)  distribution. 
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Another  example  of  a possible  situation  satisfying  the  alternative  hypo- 
thesis, Hj,  given  approximately  by  comparing  a gamma  (4,1)  with  a gamma  (3,1), 

Is  Illustrated  by  graph  2. 

Hote  that  the  Wilcoxon  Rank  Sum  Test  Is  most  sensitive  to  location,  a 
little  sensitive  to  shape,  but  not  to  dispersion  (except  as  It  relates  propor- 
tionately to  differences  In  location).  Therefore,  It  Is  the  differences  In 
^cation  that  are  of  primary  Importance  In  graphs  1 and  2. 

In  order  to  determine  the  probability  of  drawing  a value  from  distribution 
A which  Is  larger  than  a simultaneously  drawn  value  from  distribution  B,  the 
following  may  be  used: 


P = 


Vx) 


x=-« 


f^(t)dtdx 


th 


where  f^  and  fg  represent  density  functions. 

For  the  case  where  A and  B are  both  gamma  distributions, 
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For  gamma  (4,1)  and  gamma  (3,1),  P » 21/32  =<  0.65(;. 


For  normal  distributions,  use  + • as  Church-Harris- 

Downton  (C-H-D)  method  of  missile  motor  satet*  testing  [2].  (Note:  This 
reference  to  the  C-H-D  method  should  not  be  construed  as  the  author's  endorse- 
ment of  this  method  for  the  purpose  of  missile  motor  safety  testing.) 

The  calculation  of  power  under  this  alternative  Involves  a summation  o vtr  a 
typically  large  number  of  products.  Calculation  of  this  value  can  become 
extremely  time  consuming,  even  for  a high  speed  computer.  A program  was 
written  for  the  author  at  White  Sands  Missile  Range  which  will  calculate  these 
exact  values,  however,  in  general,  the  sample  sizes  must  be  very  small. 
Recently,  however,  the  author  constructed  a simulation  which  provides  estimates 
of  the  power  for  much  larger  sample  sizes.  A number  of  the  "products"  men- 
tioned earlier  are  calculated  and  the  mean  Is  computed.  The  number  of  products 
Involved  In  the  exact  calculation  can  be  determined,  and  It  Is  multiplied  by 
this  mean.  Comparison  to  values  calculated  exactly  (when  practical),  and  a 
study  of  the  sensitivity  of  the  results  to  increased  replications,  as  well  as 
comparison  to  other  simulated  values  bounding  the  results  In  the  tables,  led  to 
the  use  of  from  1 to  20  million  replications  to  simulate  values  for  the  tables 
found  In  this  paper.  (Work  has  been  done,  reference  [3],  to  determine  the 
number  of  simulation  replications  needed  under  less  radical  circumstances. 

Here,  however,  a larger  number  of  replications  appears  necessary.)  (For  n = m 


I s 50,  up  to  35  million  replications  were  used.  It  appeared,  however,  that 

* fewer  replications  using  a number  of  different  seeds  yielded  mean  answers  which 

| more  quickly  converged  to  reasonable  results,  especially  when  using  antithetic 
| seeds.) 

} 

; In  the  tables,  n Is  the  sample  size  of  the  X sample,  m Is  the  sample  size 

; of  the  Y sample,  RS  Is  the  rank  sum  for  which  type  I and  type  II  error  proba- 
bilities are  calculated,  PA  Is  the  former  of  those  probabilities,  and  PB  Is  the 
later.  Specifically,  PA  Is  the  attained  probability  of  making  an  error  If  Hq 

Is  rejected,  and  PB  Is  the  attained  probability  of  error  If  Hj  Is  rejected, 

| both  corresponding  to  the  same  RS  value.  RS  Is  always  calculated  by  adding  the 

j ranks  of  the  Y elements  In  the  comb  in  ed  sample.  Note  that  for  smaller  sample 

sizes,  power  +PB  Is  noticeably  larger  than  unity  due  to  the  discrete  nature  of 
this  test.  That  Is,  the  probability  of  obtaining  exactly  the  event  observed 
(and  no  other)  Is  non-zero. 

Three  significant  digits  are  given  for  PA  and  only  two  for  power  and  PB 
! simply  because  It  takes  fewer  replications  of  the  simulation  to  satisfactorily 
obtain  a value  for  PA  than  for  the  others. 

From  the  annex  to  table  1,  It  Is  found  empirically  that  If  x is  the  size  of 
each  of  the  two  samples,  and  f (x)  Is  the  probability  of  a type  II  error 
under  the  alternative  used  here®  adjusted  to  correspond  to  a specific  signif- 
icance level,  then,  as  a continuous  representation  of  actually  a discrete  process, 

f0j0(x)  « exp(-x/16) 


v, 


< 


for  at  least  3 < x < 40,  and  perhaps  this  approximation  could 
be  trusted  for  x = 45  or  larger.  “However,  extrapolations  are  always  more 
dangerous  than  Interpolations,  so  caution  Is  advised  for  further  extensions. 

For  a ■ 0.05, 

^0.05^  “ exP(-x/[26exp 

for  at  least  4 < x < 40,  and  perhaps  for  x substantially  larger.  Using  this 
approximation,  Tt  Is  conjectured  that  for  n = m = 66,  when  PA  Is  approximately 
0.05  (RS  ■ 4751),  then  PB  for  this  alternative  Is  also  approximately  0.05  and 
the  true  state  of  nature  would  then  quite  safely  be  said  to  (probably)  lie 
between  the  null  and  alternative  hypotheses.  (At  the  0.1  probability  level  for 
PA  and  PB,  this  could  be  said  when  n ” m = 37,  and  RS  = 1507.)  An  extrapola- 
tion to  n = m s 66  Is  questionable,  however,  and  further  extrapolation  Is  not 
advised.  Computer  simulation  for  n = m = 50  Indicates  that  for  the  top  curve 
(PA  : 0.05)  In  Annex  I to  table  1,  true  values  In  this  area  for  PB  may  be 
somewhat  smaller  than  this  curve  predicts.  For  PA  : 0.10,  PB  values  for  large 
n and  m may  be  somewhat  larger  than  predicted. 
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) In  Conover's  book  Til.  an  approximation  Is  given  to  find  RS  for  a given  PA 

j value,  (RS  * m(m  + n + 1 )/2  + _a^mn{n  + m + 1 )/l 2 , where  x-|  is  from 

!the  table  of  the  cumulative  normal  distribution.)  The  two  functions  given 
earlier  can  be  used  to  estimate  PB  values  when  PA  = 0.10  or  0.05. 

The  final  graphs,  3-7,  are  taken  from  work  the  author  directed  at  White 
u Sands  Missile  Range  In  order  to  study  this  alternative  for  the  Wllcoxon  Rank 
Sum  Test  with  emphasis  on  simulation  validation  for  missile  flight  simulations. 

I When  comparing  a very  few  live  firings  to  a substantially  larger  number  of 
simulations  for  each  scenario,  It  can  be  seen  from  these  graphs  that  once  one 
sample  Is  substantially  larger  than  the  other,  Increasing  the  larger  sample 
size  further  does  very  little  to  Improve  the  power.  These  graphs  are  contin- 
uous representations  of  what  are  actually  discrete  points.  The  values  for 
those  points  were  calculated  analytically  as  noted  In  the  acknowledgements. 
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Finally,  when  n X m,  PB  can  be  bounded  using  the  exponential  formulations 
found  earlier  In  this  paper.  If,  for  example,  RS  Is  such  that  PA  = 0.1,  and 
x-j  , Is  the  smaller  of  n and  m,  and  x^  Is  the  larger,  then  one  has  that  approx- 
imately exp( -X2/I6)  < PB  < exp(-x-|),  with  PB  somewhat  closer  to 

exp( -x-j /1 6 ) , especially  when  x1  « x2- 

For  larger  sample  sizes  than  are  handled  here,  parametric  methods  may  be 
used.  However,  In  addition  to  the  probability  of  error  associated  with  any 
conclusion  drawn  from  a parametric  test,  there  Is  the  additional  risk  involved 
In  assuming  the  distributional  forms  used  in  such  a test.  Hypothesis  tests 
should  also  be  used  to  study  these  distributional  assumptions  to  provide  a more 
complete  risk  analysis. 

EXAMPLE: 

Consider  two  sources  of  data,  X and  Y,  where  It  is  suspected  that  Y may 
represent  a population  of  larger  location  than  X,  but  this  Is  not  clear.  If  11 
observations  are  taken  from  the  X population,  and  19  observations  taken  from  Y, 
then  the  critical  value  of  the  rank  sum  (RS)  of  the  Y sample  observations 
within  the  combined  sample  which  represents  the  point  at  which  rejection  of  the 
null  hypothesis  would  occur  using  a = 0.10,  Is  approximately 


RS  s m(m  + n + 1 )/2  + 1.281fi/mn(m  + n + 1)/12 
= (19)  (31  )/2  + 1.281F/{Tp}7TT)(3l)/12 


\ : 324.3 

w 

\ 

\ 

? Therefore,  If  RS  > 325,  HQ  would  be  rejected  at  the  « = 0.10  level.  However, 

| should  RS  = 3Z5,  and  Hq  not  be  rejected,  then  the  probability  of  making  a type 

3 II  error  with  respect  to  the  alternative  hypothesis  Illustrated  in  graphs  1 and 


"Xw^V.r^.\V-wr^J »i • |]/ ■ ^»,#l 
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2 Is  approximately  bounded  by  exp  (-19/16)  and  exp  (-11/16),  so 
Note  that,  from  table  2 , when  PA  s 0.099,  PB  (10,20)  = 0.43. 
replications  In  the  program  given  In  Appendix  A,  for  m - 19,  n = 
325,  resulted  In  PA  = 0.100  and  PB  = 0.42. 
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0.30<PB<0.50. 
Using  4,000,000 
; 11,  and  RS  = 
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APPENDIX  A 


FORTRAN  CODE  FOR 
SIMULATION: 

"LEHMANN  POWER  ANALYSIS 
FOR  THE  • 

WILCOXON  RANK  SUM  TEST" 
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AODENDUM 


Multiple  applications  of  this  test  can  be  used  to  compare  two  levels  of  a 
factor  under  a number  of  conditions.  If,  for  example,  manufacturer  A produces 
a machine  which  Is  suspected  to  have  higher  reliability  under  most  scenarios 
than  a similar  machine  made  by  manufacturer  B,  then  under  each  of  the  y 

scenarios,  m^  Is  the  sample  size  of  A's  machines  and  n^  Is  the  sample  size 

of  B's  machines,  for  1 « 1 to  Y . PA^  and  PB^can  be  calculated  for  each  of  the 

scenarios.  Consider  0 < a < Y and  0 < b £Y; 

PA  Is  the  probability  of  a or, more  PA^'s  being  less  than  p^ 

(1  s 1;  Y ),  when  HQ  Is  true. 

PB  Is  the  probability  of  b or  more  PBj's  being  less  than  Pg 

(1  = 1,  Y ),  when  Hj  Is  true. 

Therefore, 


and  « - A • pA)y'x 

pb  - jb  (i)pjd  - pBr* 

p and  PR  are  chosen  to  be  reasonable  considering  sample  sizes  for  each  of  the  y 
uses.  B 

PA 

If  w ■ 1 then  the  evidence  shows  that,  In  general,  the  true  state  of  nature 
Is  just  as  likely  to  be  equivalent  to  Hj  as  HQ. 

PA 

If  s 2 then  the  evidence  Indicates  that,  In  general,  the  true  state  of 

-nature  Is  twice  as  likely  to  be  equivalent  to  HQ  as  Hj.  If  pA  and 

PB  are  small,  then  the  Indication  Is  only  that  the  irue  state  of 
nature  Is  closer  to  HQ  than  Hp  although  possibly  not  very  close 

to  either. 


(Nbte  that  another  paper  In  this  conference,  "Numerical  Validation  of 
Tukey's  Criteria  for  Clinical  Trials  and  Sequential  Testing,"  by  C.  R.  Leake, 
also  deals  with  this  type  of  problem,  and  was  of  Interest  to  this  author.) 
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At  this  time,  this  methodology  Is  being  used  to  determine  whether  survey  data 
from  a presumably  less  reliable  source  Is  compatible  with  a presumably  superior 
data  source.  Difficult  to  obtain  data  on  U.S.  Army  warehousing  activities  have, 
as  one  obvious  characteristic,  a very  flat  "peak.,"  Therefore,  a sample  median 
value  can  be  changed  drastically  by  the  addition  or  deletion  of  one  data  point. 

If  the  secondary  data  source  proves  to  provide  values  distributed  closely 
enough  to  that  of  the  primary  source,  the  advantaqe  of  including  this  source 
may  outweigh  the  disadvantage.  The  current  situation  is  more  complex 
than  this.  However,  some  results  employing  the  methodology  of  this  addendum, 
have  been  realized. 


ADDENDUM  2 


Two  approximations  for  the  power  of  this  test  which  apparently  are  good 
for  a wide  range  of  normal  alternative  hypotheses  are  to  be  found  in  • 

E.  L.  Lehmann,  Nonparametrics;  Statistical  Methods  Based  on  Ranks,  Holden-Day, 
1975.  Although  restricted  to  normal  alternatives  in  the  format  in  which  they 
are  written,  these  approximations  can  be  used  to  extend  the  tables  given  here 
to  larger  n and  m.  The  easier  of  the  two  approximations  to  apply,  in  its 
simplest  form,  is  found  on  page  73  of  the  above  reference  and  Is  essentially  as 
follows: 

. ltnz  '"a-v.  , 

P0””  ' ,l/  (.  + n + IF  W 

where  in  our  case  we  have  (y^  - Ug)/tf  s 0.610. 

Note  that  in  the  example  in  the  main  body  of  this  paper  (m  * 19, n « 11), 
that  this  approximation  gives  power  * 0,60,  which  is  consistent  with  what 
was  shown  earlier. 


